In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [3]:
#Reading the dataset in a dataframe using Pandas
df = pd.read_csv('../data/master.csv')
#Print first observations
df.head()
Out[3]:
Throughout the Machine Learning part of this project we will be using scikit-learn, an open source machine learning library for the Python programming language.
We are starting with an absurd idea: let us try to create a simple linear regression model to predict a restaurant's score given our numerics as features: the median income, population and home ownership percentage:
In [4]:
# There are some NaN values in our numerics (UT Campus and ABIA)
# Let us remove rows from those zip codes from the DataFrame:
df = df[np.isfinite(df['Population'])]
# create X and y
feature_cols = ['Med_Income', 'Population', 'Home_Ownership']
X = df[feature_cols]
y = df.Score
# follow the usual sklearn pattern: import, instantiate, fit
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, y)
# print intercept and coefficients
print lm.intercept_
print lm.coef_
In [5]:
# pair the feature names with the coefficients
zip(feature_cols, lm.coef_)
Out[5]:
In [6]:
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % lm.score(X,y))
We are explaining 1% of the variance with our model. Definitely nothing to write home about... This was to be expected anyway, since the numerical data we have added pertain to a whole zip code, not each individual restaurant.
In [8]:
# Let us check how many rows we are left with after excluding the UT Campus and ABIA areas:
len(df)
Out[8]:
As we have mentioned, the reason why the above model was so terrible is quite straightforward: even though the numerical data we have in our DataFrame (Median Income, Population, Home Ownership Percentage) pertain to a whole Zip Code, we attached them to each individual row - every restaurant was supposed to carry those features. This, of course, is absurd and wrong. However we can still try a weak prediction if instead of using all 15,957 rows we dramatically shrink our dataset to... 34 rows - i.e. one for each Zip Code. So, we will have an average restaurant score which we will be trying to predict, or, how do restaurants in a certain area do on average, given the demographics of its inhabitants.
In [9]:
average_scores = df.groupby('Zip_Code').mean()
len(average_scores)
Out[9]:
In [10]:
average_scores.head()
Out[10]:
It is of paramount importance to remember that we are NOT supposed to be extracting any useful information from this regression model we are about to apply. After all, we only have a tiny number of rows, each one corresponding to an entire Zip Code. However we might still be able to create a model that will yield some strong trends which would help us make, if not a prediction, an educated guess about the average score of all restaurants in a Zip Code with given demographic data.
Let's take it one by one:
In [17]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = average_scores['Med_Income'].values
y = average_scores['Score'].values
In [18]:
# Check the shapes of the X and y vectors:
print X.shape
print y.shape
In [19]:
# Reshape to get them to work when we fit the model:
X = X.reshape(34,1);
y = y.reshape(34,1);
In [20]:
# Fit linear regression model:
lm.fit(X, y)
# print intercept and coefficients
print lm.intercept_
print lm.coef_
In [21]:
# And score it:
lm.score(X,y)
Out[21]:
This is a not a very exciting score for our model; We are explaining 35.2% of the variance, but we have to keep into account that:
The Linear Regression Part of this Notebook is purely a toy problem, not expected to yield any respectable results. Therefore I will not bother with train/test split and cross validation techniques yet - even if those do raise my model's score, they won't compensate for the lack of data.
In [22]:
# Visualization using Seaborn:
sns.lmplot(x="Med_Income", y="Score", data=average_scores);
In [23]:
print "The Pearson Correlation coefficient between Median Income and Average Score is {0}".\
format(np.corrcoef(average_scores['Med_Income'].values, average_scores['Score'].values)[1][0])
We can see from the scatterplot and the Pearson Correlation Coefficient that there is indeed a weak positive correlation between the two quantities. Recall that for Simple Linear Regression models like ours, the square of the Pearson Correlation Coefficient is equal to the R-squared parameter we have calculated above, i.e. the fraction of the variance in our data explained by our model.
The process is exactly the same as above:
In [24]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = average_scores['Population'].values
y = average_scores['Score'].values
In [25]:
# Check the shapes of the X and y vectors:
print X.shape
print y.shape
In [26]:
# Reshape to get them to work when we fit the model:
X = X.reshape(34,1);
y = y.reshape(34,1);
In [27]:
# Fit linear regression model:
lm.fit(X, y)
# print intercept and coefficients
print lm.intercept_
print lm.coef_
In [28]:
# And score it:
lm.score(X,y)
Out[28]:
I was expecting a much lower score for this model compared to the previous one. Population isn't really a good indicator of restaurants' health inspection scores, since the more populous Zip Codes (for instance: 78704) correspond to really diverse neighborhoods with a large variety of establishments of all kinds. This will end up being an important result of this entire project: Austin's areas are really diverse.
Of course, the arguments from above against much faith in this model still stand: we have 35 pairs of data points and we are overfitting.
In [29]:
# Visualization using Seaborn:
sns.lmplot(x="Population", y="Score", data=average_scores);
In [30]:
print "The Pearson Correlation coefficient between Population and Average Score is {0}".\
format(np.corrcoef(average_scores['Population'].values, average_scores['Score'].values)[1][0])
It is quite interesting that the correlation between the two quantities is negative. Of course this could just mean that the really "small" Zip Codes correspond to rather affluent (and hence, sparsely populated) neighborhoods with very few restaurants that probably score really well in the inspection.
Another thing to take into account is that the slope of this "best fit" line is really minuscule. This means that if our model had any merit, in order for us to predict a failing score of 69, the population of an imaginary Zip Code would have to be almost... 450,000 people!
In [36]:
print "For a predicted score: {0} (just below the cutoff), the population would have to be {1}".\
format(lm.predict(450000)[0][0], 450000)
As above:
In [37]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = average_scores['Home_Ownership'].values
y = average_scores['Score'].values
In [38]:
# Reshape the X and y vectors to get them to work when we fit the model:
X = X.reshape(34,1);
y = y.reshape(34,1);
In [39]:
# Fit linear regression model:
lm.fit(X, y)
# print intercept and coefficients
print lm.intercept_
print lm.coef_
In [40]:
# And score it:
lm.score(X,y)
Out[40]:
This is quite an interesting and even unexpected result, since we had this predisposition to believe that since Median Income and Home Ownership Percentages are (or should be) strongly correlated, then the percentage of Home Ownership should be a respectable indicator (if not a predictor) for successful restaurants in the context of Health Inspections. It seems it doesn't play such a role though.
Austinites seem to prefer renting rather than buying, given the state of the Real Estate market, and some of the very affluent Zip Codes downtown have excellent restaurant scores but very few home owners. Of course the strong arguments against any predictive model still stand: very few data points and overfitting.
In [41]:
# Visualization using Seaborn:
sns.lmplot(x="Home_Ownership", y="Score", data=average_scores);
This "best fit" line illustrates how disastrous this model is. Since the percentage of home owners in a Zip Code is bounded between 0 and 100%, the model "predicts" that all restaurants, regardless of Home_Ownership fraction, would score between 90 and 92, without taking into account uncertainty for the slope...
And for completeness:
In [42]:
print "The Pearson Correlation coefficient between Home Ownership Percentage and Average Score is {0}".\
format(np.corrcoef(average_scores['Home_Ownership'].values, average_scores['Score'].values)[1][0])
The three attempts for individual linear regression predictive models have not really made us any wiser, even though they did at least show us some broad trends pertinent to our data set, which we could loosely extrapolate to other urban areas with the characteristics of Austin. (good luck with that!)
To complete our toy problem, let us attempt to fit a multiple linear regression model to our data; Let us try to predict the average score of health inspections on a Zip Code when all three features are given as inputs.
In [43]:
feature_cols = ['Med_Income', 'Population', 'Home_Ownership']
In [44]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = average_scores[feature_cols]
y = average_scores['Score'].values
In [45]:
# Check the shapes of the X and y vectors:
print X.shape
print y.shape
In [46]:
y = y.reshape(34,1);
In [47]:
# Fit linear regression model:
lm.fit(X, y)
# print intercept and coefficients
print lm.intercept_
print lm.coef_
In [48]:
# And score it:
lm.score(X,y)
Out[48]:
Now this would have been a very interesting plot, had it been possible to graph in 4 dimensions! It seems that our combined model has managed to explain almost 50% of the variance in our data, and this successfully concludes this section. There is no way to get accurate predictions of a restaurant's score if we don't obtain other features that will help us make more educated assumptions.
As we have discussed, this looks primarily like a classification problem. We are going to use the multitude of information we have for each restaurant (like its location, the cuisine, even the address) trying to classify it in one of two or more categories. I will discuss this in more detail in the next section.